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ABSTRACT 

We present multiplicative updates for solving hard and soft 
margin support vector machines (SVM) with non-negative 
kernels. They follow as a natural extension of the updates 
for non-negative matrix factorization. No additional param- 
eter setting, such as choosing learning, rate is required. Ex- 
periments demonstrate rapid convergence to good classifiers. 
We analyze the rates of asymptotic convergence of the up- 
dates and establish tight bounds. We test the performance on 
several datasets using various non-negative kernels and report 
equivalent generalization errors to that of a standard SVM. 

Index Terms — NMF, SVM, multiplicative updates 

1. INTRODUCTION 

Support vector machines (SVM) are now routinely used for 
many classification problems in machine learning |T| due to 
their ease of use and ability to generalize. In the basic case, 
the input data, corresponding to two groups, is mapped into 
a higher dimensional space, where a maximum-margin hy- 
perplane is computed to separate them. The "kernel trick" 
is used to ensure that the mapping into higher dimensional 
space is never explicitly calculated. This can be formulated 
as a non-negative quadratic programming (NQP) problem and 
there are efficient algorithms to solve it ID . 

SVM can be trained using variants of the gradient de- 
scent method applied to the NQP. Although these methods 
can be quite efficient |3j, their drawback is the requirement 
of setting the learning rate. Subset selection methods are an 
alternative approach to solving the SVM NQP problem ||2|. 
At a high level they work by splitting the arguments of the 
quadratic function at each iteration into two sets: a fixed set, 
where the arguments are held constant, and a working set of 
the variables being optimized in the current iteration. These 
methods |2|, though efficient in space and time, still require a 
heuristic to exchange arguments between the working and the 
fixed sets. 



An alternative algorithm for solving the general NQP 
problem has been applied to SVM in j4J. The algorithm, 
called M^, uses multiplicative updates to iteratively converge 
to the solution. It does not require any heuristics, such as set- 
ting the learning rate or choosing how to split the argument 
set. 

In this paper we reformulate the dual SVM problem and 
demonstrate a connection to the non-negative matrix factor- 
ization (NMF) algorithm [?]. NMF employs multiplicative 
updates and is very successful in practice due to its indepen- 
dence from the learning rate parameter, low computational 
complexity and the ease of implementation. The new formu- 
lation allows us to devise multiplicative updates for solving 
SVM with non-negative kernels (the output value of the ker- 
nel function is greater or equal to zero). The requirement of 
a non-negative kernel is not very restrictive since their set 
includes many popular kernels, such as Gaussian, polyno- 
mial of even degree etc. The new updates possess all of the 
good properties of the NMF algorithm, such as independence 
from hyper-parameters, low computational complexity and 
the ease of implementation. Furthermore, the new algorithm 
converges faster than the previous multiplicative solution of 
the SVM problem from [4| both asymptotically (a proof is 
provided) and in practice. We also show how to solve the 
SVM problem with soft margin using the new algorithm. 

2. NMF 

We present a brief introduction to NMF mechanics with the 
notation that is standard in NMF literature. NMF is a tool 
to split a given non-negative data matrix into a product of 
two non-negative matrix factors [?]. The constraint of non- 
negativity (all elements are > 0) usually results in a parts- 
based representation and is different from other factorization 
techniques which result in more holistic representations (e.g. 
PGA and VQ). 

Given a non-negative m x n matrix X, we want to repre- 
sent it with a product of two non-negative matrices W, H of 
sizes m X r and r x n respectively: 
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X « WH. 



(1) 



Lee and Seung [?] describe two simple multiplicative up- 
dates for W and H which work well in practice. These corre- 
spond to two different cost functions representing the quality 
of approximation. Here, we use the Frobenius norm for the 
cost function. The cost function and the corresponding multi- 
pUcative updates are; 



E=^\\X -WH\\f 



W = WQ 



WHH^' 



H = H(D 



W^WH' 



(2) 
(3) 



where denotes the Frobenius norm and the operator 

represents element-wise multiplication. Division is also 
element-wise. It should be noted that the cost function to be 
minimized is convex in either W or H but not in both [?]. 
In [?] it is proved that when the algorithm iterates using 
the updates (O, W and H monotonically decrease the cost 
function. 

3. SVMASNMF 

Let the set of labeled examples {{xi,yi)}fLi with binary 
class labels = ±1 correspond to two classes denoted by A 
and B respectively. Let the mapping ^{xi) be the representa- 
tion of the input datapoint Xi in space where we denote the 
space by the name of the mapping function performing the 
transformation. We now consider the problem of computing 
the maximum margin hyperplane for SVM in the case where 
the classes are linearly separable and the hyperplane passes 
through origin. 

The dual quadratic optimization problem for SVM |[T| is 
given by minimizing the following loss function: 

-j^ n n 

subject to > 0, z e 

where k{xi, Xj) is a kernel that computes the inner product 
^{xi)'^^{xj) in the space $ by performing all operations 
only in the original data space on Xi and Xj, thus defining a 
Hilbert space $. 

The first sum can be split into three terms: two terms con- 
tain kernels of elements that belong to the same respective 
class (one term per class), and the third contains only the ker- 
nel between elements of the two classes. This rearrangement 
of terms allows us to drop class labels yi,yj from the ob- 
jective function. Denoting k{xi,Xj) with fc^ and defining 
Pij — aiajkij for conciseness, we get: 



. 1 

mm — 
a 2 



i=l 



Noticing the square and the fact that fcy = ^{xi)^^{xj) we 
rewrite the problem as: 



mm^mXA)aA~<^{XB)cxB\\l- ^ a, (6) 

ie{A,B} 

subject to tti > 0, 



where the matrices Xa, Xb contain the datapoints corre- 
sponding to groups A and B respectively with the stacking 
being column- wise. The map $ applied to a matrix corre- 
sponds to mapping each individual column vector of the ma- 
trix using $ and stacking them to generate the new matrix. 
The vectors a. a, cv-b contain coefficients of the support vec- 
tors of the two groups A, B respectively. We will use the 
vector a. to denote the concatenation of vectors a^, Q^s- Ex- 
pression (|6j resembles NMF with an additional term in the 
objective [?]. The above formulation enables other metrics 
D{<^{X A)otA\\'^{X b)olb) than least squares for SVM such 
as more general Bregman divergence H). However, to be 
computationally efficient the metric used has to admit the use 
of the kernel trick. 

4. MULTIPLICATIVE ALGORITHM 

In this paper, we focus on kernel functions which are non- 
negative. A kernel function is non-negative when its output 
value is greater than or equal to zero for all possible inputs in 
its domain. We note that quite a few of the commonly used 
kernels are non-negative like Gaussian, polynomials of even 
degree, etc. We take the derivative of the objective ^ with 
respect to ola'- 



dS 

dOLA 



^{XAfHXA)aA - <^{XAfHXB)aB - 1 
K{XA,XA)aA - {K{XA,XB)aB + 1) 



We slightly abuse notation to define a matrix kernel as fol- 
lows: K{C, D) is given by the matrix whose {i, jY^ element 
is given by the inner product of i*'* and j*^ datapoints of ma- 
trices C,D respectively in the feature space $ for all values 
of in range. We note that the derivative has a positive 
and a negative component. Similarly, we take the derivative 
with respect to as- Recalling the updates for NMF from pre- 
vious section, we write down the multiplicative updates for 
this problem 



OiA = cxaQ 



OtB = 



K{XA,XB)aB + l 

K{XA,XA)aA 
K{Xb,Xa)ola + 1 
K{XB,XB)aB 



(7) 



subject to ai > 0, i e {l..n}. 



where 1 is an appropriately sized vector of ones and de- 
notes Hadamard product as before. We call this new algo- 
rithm Multiplicative Updates for Non-negative Kernel SVM 
(MUNK). 



The convergence of the above updates follow from the 
proof of convergence of the regular NMF updates [?]. Fur- 
thermore, since the Hessian of the joint problem of estimating 
a. A and is positive semi-definite the alternating updates 
have no local minima only the global minimum. 

5. SOFT MARGIN 

We can extend the multiplicative updates to incorporate upper 
bound constraints of the form ai < I where I is a constant as 
follows: 

ai = mill {ai,l} (8) 

These are referred to as box constraints, since they bound ai 
from both above and below. 

The dual problem for soft margin S VM is given by: 



min S(a) 

Oi 



subject to < ai < C,i ^ {I..71}, 



(9) 



The parameter C is a regularization term, which provides a 
way to avoid overfitting. Soft margin SVM involves box con- 
straints that can be handled by the above formulation. At each 
update of a, we implement a step given by ([8]) to ensure the 
box constraint is satisfied. This corresponds to potentially re- 
ducing the step size of the multiplicative update of an element 
and since the problem is convex this will still guarantee mono- 
tonic decrease of the objective. 

6. ASYMPTOTIC CONVERGENCE 

Sha et al. IH observed a rapid decay of non-support vector 
coefficients in the M-* algorithm and performed an analysis of 
the rate of asymptotic convergence. They perturb one of the 
non-support vector coefficients, e.g. oti, away from the fixed 
point to some nonzero value 6ai and fix all the remaining 
values. Applying their multiplicative update gives a bound on 
the asymptotic rate of convergence. 

Let di = K{xi, w)/ ^JK{w, w) denote the perpendicu- 
lar distance in the feature space from Xi to the maximum mar- 
gin hyperplane and d ~ min^ di = 1/ ^K{w^ w) denote the 
one-sided margin to the maximum-margin hyperplane. Also, 
li — \J K{xi,Xi) denotes the distance of Xi to the origin 
in the feature space and / = max^ li denote the largest such 
distance. The following bound on the asymptotic rate of con- 
vergence 7j^^ was established: 



It 
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1 (d. - d)d 

2 hi 



(10) 



We perform a similar analysis for rate of asymptotic con- 
vergence of the multiplicative updates of the MUNK algo- 
rithm. We perturb one of the non-support vector coefficients 
fixing all the other coefficients and apply the multiplicative 
update. This enables us to calculate a bound on rate of con- 
vergence. A bound on the asymptotic rate of convergence in 
terms of geometric quantities is given as follows: 
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Table 1. Misclassification rates (%) on the breast cancer and 
sonar datasets after convergence of the M^, MUNK (M) and 
Kernel Adatron (KA) algorithms. Polynomial kernels of de- 
gree 4 and 6 and Gaussian kernels of fi 1 and 3 were used. 
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The proof sketch can be found in appendix. We note 
that our bound is tighter compared to the algorithm as 

MUNK < 

a — a • 

7. EXPERIMENTS 

In order to demonstrate the practical applicability of the theo- 
retical properties proved in previous section, we test the above 
updates on two real world problems consisting of breast can- 
cer dataset and aspect-angle dependent sonar signals from the 
UCI Repository 16J. They contain 683 and 208 labelled ex- 
amples respectively. The breast cancer dataset was spUt into 
80% and 20% for training and test sets respectively. The sonar 
dataset was equally divided into training and test sets. The 
vectors a were initialized the same in all algorithms. Differ- 
ent kernels involving polynomial and radial basis functions 
were applied to the dataset. For comparison we also pro- 
vide results for the and Kernel-Adatron (KA) |3| algo- 
rithms. Misclassification rates on the test datasets are shown 
in Table[T] They match previously reported error rates on this 
dataset H. 

These results support our derivations and demonstrate that 
the algorithm can be used for training SVM with non-negative 
kernels. However, since the problem is convex and there ex- 
ists a unique solution all correct algorithms will converge to 
the same solution and arrive at the same classification error 
rates. 

MUNK is slightly faster per iteration than M^ due to an 
extra square root and multiplication per training pattern in the 
M^ algorithm. We ignore that slight difference and plot the 
objective function per iteration of MUNK and M^ algorithms 
on the Breast and Sonar sets in Figure [T] The result agrees 
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Fig. 1. Convergence of the objective with iterations when 
training with Gaussian kernel (a — 3). Lower curve means 
faster convergence. Note that x axis is logarithmic, indicating 
a multiplicative speedup for MUNK over a wide operating 
range. 



with the theoretically shown upper bound: MUNK converges 
about twice as fast as M-'. 



8. CONCLUSIONS 

We have derived simple multiplicative update rules for solv- 
ing the maximum-margin classifier problem in SVMs with 
non-negative kernels. No additional parameter tuning is re- 
quired and the convergence is guaranteed. The updates are 
straight-forward to implement. The updates could also be 
used as part of a subset method which could potentially speed 
up MUNK algorithm. MUNK shares the utility of M^ al- 
gorithm in that it is easy to implement in higher-level lan- 
guages like MATLAB with application to small datasets. It 
also shares the drawback of M^ in its inability to directly 
set a variable to zero. However, we have shown MUNK to 
have an asymptotically faster rate of convergence compared 
to M^ algorithm and we believe this provides a motivation 
for further research in multiplicative updates for support vec- 
tor machines. Also the derivation was constructed in such a 
way that it highlights the connection between SVM with a 
non-negative kernel and NMF. Since multiplicative updates 
emerge in different settings and algorithms it might be inter- 
esting to find the pattern of when such updates are possible 
and how to automatically derive them. Our presentation of 
NMF and SVM correspondence can be considered a step to- 
wards this dkection. 
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Appendix 

Let the fixed point be a* and K{Xa, Xa)(^*a be denoted by 
2;+ and K{XatXb)ol*b by . If we choose an ith non- 
support vector coefficient from a. a, then we have — > 
1. Let the multiplicative factor be denoted by 7^. We then 
have: 



1 

It 



1 



1 



1 



1 



> 1 



K{xi, w) — 1 



where w = J2i ct*Xiyi is the normal vector to the maximum 
margin hyperplane. We have used the following: 



zt — z- 



■ ^ kijO* - ^ hkOil = K{xi,w), 



where fey = K{xi,Xj). 

We now obtain a bound on the denominator: 

z^ = K{xi,Xj)oL* < max. K{xi,Xk) ot* 



< ^K{xi,Xi) nmx y/K{xk,Xk)K{w,w) 

We have used the Cauchy-Schwartz inequality for kernels and 
an upper bound for the sum of vector a^. 

We do a similar analysis by perturbing an ith non-support 
vector coefficient from group B. Combining the analysis, the 
lower bound is: 
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